MBD @AP2022
CAPSTONE PROJECT
Group A

The end of this notebook is to define Churn in ClientCo® clients based on the predictions obtained from a combinaitons of analysis and models:
Trend Analysis | RFM Analysis| CLV | K-Means| Isolation Forest
After the analysis a Classification Machine Learning model will be implemented, to predict expected Churn.
+ Info: https://blackboard.ie.edu/ultra/courses/_46938_1/outline/file/_683290_1
[Churn].csv format, converted to a more efficient .parquet.churn or not.
Let's start off with the data.
First of all, we have a single dataset:
data.parquetOur data is not split for us. We may use an 80% of the complete dataset to for the train and a posterior K-Fold CV to tune the hyperparameters. The spare 20% will be used in the test, to measure how our model generalizes. This is the strategy we will apply.
We will import the self-made mlTools package, in order to import, split, and explore our datset with ease:
SEED = 42
TEST_SIZE = 0.2
SAMPLE_SIZE = 1000000
%matplotlib inline
# General
from mlTools import dataLoader, dataSplitter, dataExplorer, dataProcessor
import numpy as np
import pandas as pd
# Plotting
import matplotlib.pyplot as plt
import plotly.express as px
import phik
import seaborn as sns
from pandas.plotting import scatter_matrix
# RFM
from rfm import RFM
# CLV
from lifetimes.utils import summary_data_from_transaction_data
from lifetimes import BetaGeoFitter
from lifetimes.plotting import plot_frequency_recency_matrix
from lifetimes.plotting import plot_probability_alive_matrix
from lifetimes.plotting import plot_period_transactions
from lifetimes.plotting import plot_cumulative_transactions
from lifetimes.plotting import plot_incremental_transactions
# Clustering
from sklearn.cluster import KMeans
from sklearn.cluster import DBSCAN
# Feature Extraction
from sklearn.decomposition import PCA
# Anomaly
from sklearn.ensemble import IsolationForest
# Preprocessing
from sklearn.preprocessing import OrdinalEncoder
from sklearn.preprocessing import StandardScaler, MinMaxScaler
# Training & Validation
from sklearn.model_selection import cross_val_score
from sklearn.metrics import mean_absolute_percentage_error
from sklearn.pipeline import Pipeline
from sklearn.model_selection import RandomizedSearchCV
from sklearn.linear_model import LogisticRegressionCV
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import RandomForestClassifier
# Deployment
import pickle
loaderObj = dataLoader()
data = loaderObj.batch_loader("./data/data.parquet",False)
Now that the whole data is imported, let's perform the fixed split we mentioned before:
splitterObj = dataSplitter(data)
train,test = splitterObj.train_splitter("sales_net", TEST_SIZE, SEED, False)
train = train.sample(SAMPLE_SIZE)
test = test.sample(int(SAMPLE_SIZE*TEST_SIZE))
Our data is now imported and split as train and test, both as a pandas DataFrame.
Now, we will perform an Exploratory Data Analysis on the train.
NOTE: The split done is random, and not stratified as the distribution of sales is not unbalanced. However, in other analysis in this same dataset (a.k.a. churn), an stratified split and sampling should be the way to go.
In this step of the Pipeline we will grasp for the first time the substance of our data: we will learn what features are present in our trainset, check their types and how their values are distributed.
This step is fundamental to set a strategy around the next step, Data Processing.
We will start with the basics. As mentioned earlier, we will make use of our own mlToolslibrary, which uses pandas, numpy, matplotliband others as backend.
In this basic overview, the output will be quivalent to using pandas.head(), pandas.info()and pandas.describe() methods.
categorical = ["order_channel","branch_id"]
numerical = ["product_id", "client_id", "sales_net", "quantity"]
explorerObj = dataExplorer(train, categorical, numerical)
explorerObj.basic_explorer("sales_net")
| date_order | date_invoice | product_id | client_id | sales_net | quantity | order_channel | branch_id | |
|---|---|---|---|---|---|---|---|---|
| 39721857 | 2018-12-26 | 2018-12-26 | 1366829 | 1905984 | 1.619200 | 11 | at the store | 3439 |
| 5891554 | 2017-12-01 | 2017-12-01 | 2512504 | 861335 | 0.000000 | 3 | at the store | 3417 |
| 44786026 | 2019-02-27 | 2019-02-27 | 2975027 | 69381 | 182.528000 | 31 | online | 20 |
| 16295413 | 2018-03-26 | 2018-03-27 | 1895362 | 1583937 | 0.414000 | 3 | online | 1888 |
| 31304247 | 2018-09-25 | 2018-09-25 | 1180925 | 1600012 | 63.970728 | 401 | by phone | 7101 |
<class 'pandas.core.frame.DataFrame'> Int64Index: 1000000 entries, 39721857 to 50519451 Data columns (total 8 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 date_order 1000000 non-null object 1 date_invoice 1000000 non-null object 2 product_id 1000000 non-null int64 3 client_id 1000000 non-null int64 4 sales_net 1000000 non-null float64 5 quantity 1000000 non-null int64 6 order_channel 1000000 non-null object 7 branch_id 1000000 non-null int64 dtypes: float64(1), int64(4), object(3) memory usage: 68.7+ MB
None
| product_id | client_id | sales_net | quantity | branch_id | |
|---|---|---|---|---|---|
| count | 1.000000e+06 | 1.000000e+06 | 1.000000e+06 | 1000000.000000 | 1000000.000000 |
| mean | 1.633697e+06 | 1.139933e+06 | 1.486738e+02 | 89.665476 | 5467.184475 |
| std | 9.191704e+05 | 6.553165e+05 | 1.895873e+03 | 818.111417 | 3176.086567 |
| min | 1.500000e+01 | 6.000000e+00 | -1.311241e+05 | 3.000000 | 20.000000 |
| 25% | 8.517370e+05 | 5.650640e+05 | 1.407600e+01 | 3.000000 | 2907.000000 |
| 50% | 1.625584e+06 | 1.152568e+06 | 4.425195e+01 | 5.000000 | 5226.000000 |
| 75% | 2.436220e+06 | 1.707522e+06 | 1.310755e+02 | 21.000000 | 8453.000000 |
| max | 3.238739e+06 | 2.274517e+06 | 1.740456e+06 | 192001.000000 | 11057.000000 |
0.000000 38340
2.290800 4809
4.609200 2582
1.159200 1735
0.414000 1525
...
311.932133 1
670.128000 1
684.065387 1
274.587156 1
-58.627000 1
Name: sales_net, Length: 243428, dtype: int64
Firstly, we see we only have 8 features. Some are Categorical, others Numerical. We must cast both dates, and from it we might obtain interesting new features.
In the end, having 8 features is not sufficient to create a robust Machine Learning model, so some Feature Engineering might be necessary to output a useful model.
From this basic analysis we can also see some of the basic statistics on how the data is distributed.
The method returns some of the statistics behind a Boxplot, such as the quartiles. It also gives some information around the mean and standard deviation of each feature.
From this we can sense the range of values in which the different features range. We can also check the possibility of Outliers.
One possible outlier may reside in the features sales_net and quantity, as their maximum values fall by far out of the IQR. In the following steps, we will further analyze this possibility.
Finally, regarding the count of rows we have per attribute, as also the type of each, using the info() method comes quite handy.
Summary
Features: ['date_order', 'date_invoice', 'product_id', 'client_id', 'sales_net', 'quantity', 'order_channel', 'branch_id']
['sales_net','quantity']['product_id', 'client_id', 'order_channel', 'branch_id']['date_order', 'date_invoice']['date_order', 'date_invoice']From this analysis we also point out that some features are incorrectly formatted, such is the case of date_order and date_invoice.
Even if this step may be considered as part of a processing stage, casting both features is fundamental to perform a more complete exploration. All this changes will also optimize the size-in-memory of the dataset.
Asides from that, some other features should be casted to a more efficient formatting. We will perform all these casts in the following method:
def data_optimize(df):
df.product_id= df.product_id.astype(np.int32)
df.client_id= df.client_id.astype(np.int32)
df.quantity= df.quantity.astype(np.int32)
df.branch_id= df.branch_id.astype(np.int16)
df.sales_net=df.sales_net.astype(np.float32)
df['date_order'] = pd.to_datetime(df['date_order'], format='%Y-%m-%d')
df['date_invoice'] = pd.to_datetime(df['date_invoice'], format='%Y-%m-%d')
df.order_channel= df.order_channel.astype('category')
return df
train=data_optimize(train)
Let's see the effect of those changes:
Nice! That's more than a 40% decrease in memory usage! This will make the posterior analysis way more efficient.
From this method, we confirm the types we analyzed before.
Summary
Casting:
['date_order', 'date_invoice']-> datetime64['product_id', 'client_id', 'sales_net', 'quantity', 'branch_id']-> optimizedFeature Engineering: ['data_order_year','data_order_month', 'data_order_dayofmonth', 'data_order_dayofweek']
But wait a second...
60+ millions of rows?
That's a massive dataset!
We may need to prepare ourselves to perform some sampling in the next steps of the EDA, or find out more efficient ways to import such a dataset.
The dataset is so huge info() doesn't return the whole information in regards to missing values.
Let's explore them with isnull():
print(train.isnull().sum())
date_order 0 date_invoice 0 product_id 0 client_id 0 sales_net 0 quantity 0 order_channel 0 branch_id 0 dtype: int64
Fortunately, there is only a fully NaN value, located in the date_invoice. This might mean lots of things:
However, we won't bother much. We can safely get rid of this single missing value, as it's not relevant enough.
Let's analyze our target feature, an see if we point any anomalies:
display(explorerObj.outlier_explorer('sales_net'))
None
None
Indeed, there are some anomalies!
There seems to be negative values, which may seem impossible. However, this may be the product of a refund, or incorrect invoice.
Our future strategy will be to get rid of them, as considered erroneous. If our datasethad gave us the chance to track down a payment and refund, a more exhaustive analysis would be done.
Summary
Missing Values: date_invoice -> Drop
Negative Values: sales_net-> Drop
We can also take a look at the categorical features, to check the different values we have.
This step is significant to plan out an strategy to later Encoding.
print(train["order_channel"].value_counts(), train["branch_id"].value_counts())
at the store 507997
by phone 401129
online 89498
other 912
during the visit of a sales rep 464
Name: order_channel, dtype: int64 3318 10238
1888 9010
4080 8381
5489 7621
6702 7146
...
807 2
2199 1
524 1
6468 1
7076 1
Name: branch_id, Length: 551, dtype: int64
We can easily check that One Hot Encoding should be the way to go for order_channel, as there increment of features won't be too heavy.
This is not the case for [branch_id, product_id, client_id]. We might be tempted to implement Target Encoding. We will dismiss this possibility for now, as those features won't be useful for the model.
Summary
order_channel-> One Hot EncodingSpeaking of correlations, let's have a look at them.
As at the moment we have both numericaland categoricalfeatures, we can analyize the correlations (both linear an non linear) with the phik_matrix()method.
corr_matrix = train.phik_matrix()
interval columns not set, guessing: ['product_id', 'client_id', 'sales_net', 'quantity', 'branch_id']
sns.set(font_scale=2)
fig, ax = plt.subplots(figsize=(50,50))
sns.heatmap(corr_matrix, annot=True, linewidths=.5, ax=ax)
<AxesSubplot:>
We shouldn't fool ourselves: the correlations don't look promising to establish an strategy. Small to no correlation is observed in between the default features (if we dismiss the obvious duplicated dates).
As an Supervised Learning system, we can compare the correlations to a target, in this case, to sales_net. In this aspect, and as obvious as it seems, the quantity shows a strong correlation to the sales_net.
We can, at least, see if there is any interesting interaction in between the numericalfeatures we described previously:
numerical = ["sales_net", "quantity"]
scatter_matrix(train[numerical], figsize = (10, 10));
From the scatter_matrix()we can not only see the distribution of each feature, but also the interactions in between them.
Another interesting element to study is if there is any observable seasonality on the trends of our transactions. Let's plot how the clients behaved during the years 2017-2019:
train_seasonality = train.reset_index().set_index('date_order')['sales_net'].resample(rule='MS').sum().to_frame()
plt.figure(figsize=(30,20))
train_seasonality.plot();
<Figure size 2160x1440 with 0 Axes>
From the plot we can spot some patterns in October, where after some growth, it slowly decades until February. From February - March there seems to be a consistent growth, followed by a drop in April-May.
Then, another growth in sales is experienced during Jun-Aug, followed by a drop after it, and beginning again the cycle.
Being an Supervised Learningsystem, we should expect the usage of classic Regression algorithms.
Even though it's not mandatory on selected models, it is recommended scaling our data. The reason for this is to prevent the model from compensating the traineable parameters due to a significant scale difference.
There are no requirements in regards to anomalous data, and therefore, a simple StandardScaler() should be enough.
The concepts analyzed in this EDA stage, up until now, are very common. However, even if our goal is to construct a final Supervised Learning Binary Classifier, we must recall we still do not have our labels: we must create them.
In order to create them, we have to analyze the probability of Churn from various perspectives. In the steps that follow this EDA, we will discuss different perspectives that may guide us to our desired labels.
The analysis we will perfom include:
Trend Analysis: analyze whether a customer is likely to churn if their delay in the last orders is higher than his own average.
RFM: a classic technique in Marketing Segmentation, RFM still may be useful to cluster customers depending on their Recency, Frequency and Monetary scores.
K-Means: another classic, in this case as an Unsupervised Learning model to cluster data into groups by similarity.
Isolation Forest: a particular model used to analyze outliers, in order to track down those Churn instances that behave in a unconventional way.
The predictions from all these analysis will be merged into a single column, Churn, in a similar fashion to a VotingClassifier ensemble model, by manually setting Churn to 0 or 1 depending on the majority vote.
Here is a diagram showing the concept:

Let's analyze one by one, a preview on how they would work with our preprocessed sampled_train, to further implement it as a Feature Engineering step.
NOTE: We will heavily sample our train, as the size of the data limits the ability to compute. The chosen sampling method is random, and not stratified. This should be enough to get a sense of what features might be explanatory of our targets' behaviour.
One of the most primitive ways to analyze whether a customer is likely to churn or not, is to analyze the delay between orders: if the delay from the last kperiods is superior to that customer's average, we may induce the customer is going to churn.
Let's see how we could implement this:
sort_valuesprev_order featuredelay between date_order and prevorder`avg_delay by averaging the delays from groupby.delay>avg_delay in the last k=2 orders, provided they are not done very recently, then Churn. sampled_train = pd.read_csv('./processed/sample_example.csv', index_col=[0])
sampled_train['date_order'] = pd.to_datetime(sampled_train['date_order'], format='%Y/%m/%d')
sampled_train = sampled_train.sort_values(['client_id', 'date_order'])
sampled_train['prev_order'] = sampled_train['date_order'].shift(2)
sampled_train['delay'] = sampled_train['date_order'] - sampled_train['prev_order']
sampled_train['avg_delay'] = sampled_train.groupby('client_id')['delay'].transform('mean')
sampled_train['trend_pred'] = sampled_train['delay'].gt(sampled_train['avg_delay']).astype(int)
sampled_train['trend_pred'] = sampled_train.groupby('client_id')['trend_pred'].transform(lambda x: 1 if (x == 1).any() else 0)
num_orders = sampled_train.groupby('client_id').size()
last_order_date = sampled_train.groupby('client_id')['date_order'].last()
if ((num_orders < 2) & last_order_date.dt.year.eq(2019) & last_order_date.dt.quarter.eq(4)).any():
sampled_train['trend_pred']=0
sampled_train = sampled_train.drop(['prev_order', 'delay', 'avg_delay'], axis = 1)
sampled_train['trend_pred'] = sampled_train['trend_pred'].astype(np.uint8)
sampled_train['trend_pred'].value_counts()
0 5699 1 3560 Name: trend_pred, dtype: int64
One of our main pillars for the Capstone is launching an RFM Model.
But what is it? And how this analysis may be useful for our models?
Let's find that out!
Concept
RFM is a behavioural segmentation method to analyze customers, from transactional data.
The following is what RFM stands for:
Recency shows you how recently a customer bought from your store.
Frequency reflects how often a customer purchases from your brand.
Monetary value represents how much a customer usually spends with your store.
In a nutshell, using those 3 parameters, we will be able to score and cluster clients, used later to launch the specific campaigns in order to maximize sales_net.
Usefulness
Using the rfm package, the dataset will be enriched by the addition of distict new features:
Features: -> [recency,frequency,monetary_value,r,f,m,rfm_score,segment]
Those new features may be useful both to predict sales, attending to the segment in which the client stands and to cluster clients.
Example of Implementation
r = RFM(sampled_train, customer_id='client_id', transaction_date='date_order', amount='quantity')
From the generated DataFrame, r.rfm_table, we see how the clients were scored according to the recency, frequency and monetary value they brought to the company.
Asides from that, attending to an standard classification, clients were clustered into distinct groups:
Champions -> {R: 4-5, F: 4-5, M: 4-5}Promising -> {R: 4-5, F: 1-2, M: 3-4-5}Loyal Accounts -> {R: 3-4-5, F: 3-4-5, M: 3-4-5}Potential Loyalist -> {R: 3-4-5, F: 2-3, M: 2-3-4}New Active Accounts -> {R: 5, F: 1, M: 1-2-3-4-5}Low Spenders -> {R: 3-4-5, F: 1-2-3-4-5, M: 1-2}Need Attention -> {R: 2-3, F: 1-2, M: 4-5}About to Sleep-> {R: 2-3, F: 1-2, M: 1-2-3}At Risk-> {R: 1-2, F: 1-2-3-4-5, M: 3-4-5}Lost-> {R: 1-2, F: 1-2-3-4-5, M: 1-2}+ Info: https://github.com/sonwanesuresh95/rfm/blob/main/rfm/rfm.py
Now, this classification is standard, which means is not specific and maybe not suitable for our company. A further study on the way our clients behave should be done to drill down the most suitable classification. However, at the moment, we will accept this standard as correct.
We can easily plot how clients are distributed according to the three parameters:
r.plot_rfm_histograms()
But a more useful visualization would be a Treemap, in which to see, proportionally, the percentage of each cluster in our dataset.
To implement it, we will use plotly.express.
labels = list(r.segment_table['segment'])
values = list(r.segment_table['no of customers'])
percentages = []
for value in values:
percentages.append(int(value/sum(values)*100))
r.segment_table["perc"] = percentages
r.segment_table = r.segment_table.drop("no of customers", axis = 1)
r.segment_table
| segment | perc | |
|---|---|---|
| 0 | Champions | 10 |
| 1 | Loyal Accounts | 16 |
| 2 | Low Spenders | 17 |
| 3 | Potential Loyalist | 4 |
| 4 | Promising | 6 |
| 5 | New Active Accounts | 1 |
| 6 | Need Attention | 5 |
| 7 | About to Sleep | 5 |
| 8 | At Risk | 18 |
| 9 | Lost | 12 |
fig = px.treemap(r.segment_table, path=r.segment_table.columns, values = percentages)
fig.show()
That's a better visualization!
We can clearly spot how the different cluster generated by the RFM analysis, are distributed.
We can also try to plot each instance in a 3d projection:
enc = OrdinalEncoder()
r.rfm_table[['segment']] = enc.fit_transform(r.rfm_table[['segment']])
scaler = MinMaxScaler()
r.rfm_table[['recency', 'frequency', 'monetary_value']] = scaler.fit_transform(r.rfm_table[['recency', 'frequency', 'monetary_value']])
r.rfm_table["client_id"] = r.rfm_table["client_id"].astype('int64')
sampled_train = sampled_train.merge(r.rfm_table, on='client_id', how='inner')
sampled_train = sampled_train.drop(['r','f','m','rfm_score'], axis = 1)
fig = px.scatter_3d(r.rfm_table, x='frequency', y='recency', z='monetary_value',
color='segment')
fig.show()
Just for the sake of experimenting, we can try setting a threshold for Churn: if a client belongs to segment 6 or higher, then it's Churn(either Lost, About to Sleep, At Risk).
r.rfm_table['Churn'] = r.rfm_table['segment'].apply(lambda x: 0 if x < 6 else 1)
fig = px.scatter_3d(r.rfm_table, x='frequency', y='recency', z='monetary_value',
color='Churn')
fig.show()
Let's move on to another classic analysis made for customer segmentation: CLV:
CLV is a technique used to predict the total value that a customer will bring to a business over their relationship. It's based on the value brought by the customer and the duraiton of the relationship with the company.
Let's see its implementation with the lifetime package:
sampled_train = pd.read_csv('./processed/sample_example.csv', index_col=[0])
plt.rcParams["figure.figsize"] = (20,10)
lt_RFM = summary_data_from_transaction_data(sampled_train,
'client_id', 'date_order', 'sales_net', observation_period_end='2019-09-20')
bgf = BetaGeoFitter(penalizer_coef=0.001)
bgf.fit(lt_RFM['frequency'], lt_RFM['recency'], lt_RFM['T']);
plot_frequency_recency_matrix(bgf);
plot_probability_alive_matrix(bgf);
t = 3
lt_RFM['predicted_purchase'] = bgf.conditional_expected_number_of_purchases_up_to_time(
t, lt_RFM['frequency'], lt_RFM['recency'], lt_RFM['T'])
lt_RFM.sort_values(by='predicted_purchase').head()
| frequency | recency | T | monetary_value | predicted_purchase | |
|---|---|---|---|---|---|
| client_id | |||||
| 44899 | 0.0 | 0.0 | 725.0 | 0.0 | 0.00095 |
| 2024114 | 0.0 | 0.0 | 725.0 | 0.0 | 0.00095 |
| 1371668 | 0.0 | 0.0 | 725.0 | 0.0 | 0.00095 |
| 203944 | 0.0 | 0.0 | 725.0 | 0.0 | 0.00095 |
| 274520 | 0.0 | 0.0 | 725.0 | 0.0 | 0.00095 |
plot_period_transactions(bgf, max_frequency=100).set_yscale('log')
plot_incremental_transactions(bgf, sampled_train, 'date_order', 'client_id', 513, 363, freq='D');
summary_with_money_value = lt_RFM
returning_customers_summary = summary_with_money_value[summary_with_money_value['frequency']>0]
returning_customers_summary.head()
| frequency | recency | T | monetary_value | predicted_purchase | |
|---|---|---|---|---|---|
| client_id | |||||
| 252 | 1.0 | 161.0 | 347.0 | 169.01135 | 0.004269 |
| 7121 | 1.0 | 20.0 | 219.0 | 231.23740 | 0.004968 |
| 7652 | 1.0 | 212.0 | 297.0 | 58.05660 | 0.004674 |
| 10157 | 1.0 | 78.0 | 494.0 | 185.45912 | 0.003348 |
| 11469 | 1.0 | 79.0 | 689.0 | 34.84040 | 0.002623 |
returning_customers_summary[['monetary_value', 'frequency']].corr()
| monetary_value | frequency | |
|---|---|---|
| monetary_value | 1.000000 | -0.005316 |
| frequency | -0.005316 | 1.000000 |
from lifetimes import GammaGammaFitter
ggf = GammaGammaFitter(penalizer_coef = 0.001)
ggf.fit(
returning_customers_summary['frequency'],
returning_customers_summary['monetary_value']
)
<lifetimes.GammaGammaFitter: fitted with 1346 subjects, p: 3.08, q: 0.69, v: 6.88>
print(
"Expected conditional average revenue: %s, Average revenue: %s" % (
ggf.conditional_expected_average_profit(returning_customers_summary['frequency'],
returning_customers_summary['monetary_value']).mean(),
summary_with_money_value[summary_with_money_value['frequency']>0]['monetary_value'].mean()));
Expected conditional average revenue: 197.9877928977734, Average revenue: 175.87519716410674
returning_customers_summary
| frequency | recency | T | monetary_value | predicted_purchase | |
|---|---|---|---|---|---|
| client_id | |||||
| 252 | 1.0 | 161.0 | 347.0 | 169.011350 | 0.004269 |
| 7121 | 1.0 | 20.0 | 219.0 | 231.237400 | 0.004968 |
| 7652 | 1.0 | 212.0 | 297.0 | 58.056600 | 0.004674 |
| 10157 | 1.0 | 78.0 | 494.0 | 185.459120 | 0.003348 |
| 11469 | 1.0 | 79.0 | 689.0 | 34.840400 | 0.002623 |
| ... | ... | ... | ... | ... | ... |
| 2260903 | 1.0 | 318.0 | 354.0 | 110.142400 | 0.004407 |
| 2263434 | 1.0 | 226.0 | 291.0 | 76.217400 | 0.004734 |
| 2268557 | 1.0 | 474.0 | 548.0 | 30.879800 | 0.003550 |
| 2272758 | 8.0 | 555.0 | 634.0 | 122.753841 | 0.022653 |
| 2273574 | 3.0 | 290.0 | 378.0 | 31.899467 | 0.011826 |
1346 rows × 5 columns
bgf.fit(returning_customers_summary['frequency'], returning_customers_summary['recency'],
returning_customers_summary['T'])
print(ggf.customer_lifetime_value(bgf,
returning_customers_summary['frequency'],
returning_customers_summary['recency'],
returning_customers_summary['T'],
returning_customers_summary['monetary_value'],
time=5, # months
discount_rate=0.01).head(10))
client_id 252 35.971006 7121 47.139106 7652 18.677111 10157 15.547479 11469 1.572376 12251 2.557314 14147 2.887103 14188 51.875260 15024 0.636711 16720 35.133160 Name: clv, dtype: float64
plot_period_transactions(bgf, max_frequency=100).set_yscale('log')
plot_cumulative_transactions(bgf, sampled_train, 'date_order', 'client_id', 513, 363, freq='D');
Hooray!
We have now a set of grouped customers according to our RFM and also according to CLV!
Nonetheless, these groups may overlap on each other, and may even be counterituitive to split.
For that reason, and for the sake of experimenting, let's see how K-Means would cluster our customers in comparison to our RFM analysis:
sampled_train = pd.read_csv('./processed/sample_example.csv', index_col=[0])
sampled_train = sampled_train.drop(['date_order'], axis = 1)
sampled_train = sampled_train.astype(float)
sampled_train = sampled_train.dropna()
sse = {}
for k in range(1, 10):
kmeans = KMeans(n_clusters = k, max_iter=100000).fit(sampled_train)
sse[k] = kmeans.inertia_
plt.figure()
plt.plot(list(sse.keys()), list(sse.values()))
plt.xlabel("Number of cluster")
plt.ylabel("SSE")
plt.show()
This means that, a 90% of the variance (less than a 10% of the Sum of Squares Error) in our RFM Analysis may be explained with just 2 clusters!
Let's apply it:
k = 2
kmeans = KMeans(n_clusters = k, random_state = SEED, max_iter=100000).fit(sampled_train)
sampled_train['kmeans'] = kmeans.predict(sampled_train)
We must always keep in mind that, even thought there is a clear overlap on this visualization, in different hyperplanes this might change. Let's see how this goes if plotted after PCA:
pca = PCA(k)
projected = pca.fit_transform(sampled_train)
scaler = MinMaxScaler()
projected = scaler.fit_transform(projected)
fig = plt.figure(figsize=(10,10))
plt.scatter(projected[:, 0], projected[:, 1],
c=sampled_train['kmeans'], edgecolor='none', alpha=0.5,
cmap=plt.cm.get_cmap('rainbow', 10))
plt.xlabel('PC1')
plt.ylabel('PC2');
For the sake of experimenting, we can go the extra mile: sometimes, Churn can be an isolated and abnormal behaviour, that may be detected via anomaly detection.
For that reason, we may include the anomalies detected by an Isolation Forest model as a new feature.
let's see if we actually spot any:
sampled_train = pd.read_csv('./processed/sample_example.csv', index_col=[0])
sampled_train = sampled_train.drop(['date_order'], axis = 1)
model = IsolationForest(random_state=SEED)
model.fit(sampled_train)
pred = model.predict(sampled_train)
sampled_train['is_anomaly'] = pred
sampled_train['is_anomaly'] = sampled_train['is_anomaly'].replace(-1, 1).replace(1, 0)
sampled_train['is_anomaly'].value_counts()
0 9259 Name: is_anomaly, dtype: int64
['sales_net','quantity']['product_id', 'client_id', 'order_channel', 'branch_id']['date_invoice', 'date_order']['sales_net']['date_invoice'] -> Drop instance['sales_net'] -> Drop instance['order_channel', 'segment']['date_invoice', 'product_id', 'order_channel', 'branch_id']['date_invoice', 'product_id', 'order_channel', 'branch_id']Numerical: ['data_order_year', 'data_order_month', 'data_order_dayofmonth', 'data_order_dayofweek', 'trend', 'segment', 'is_anomaly', 'cluster', 'ensemble']MinMaxScaler()So, we've seen a lot in this train set. We saw the amount of features, how they are distributed and the possible future combinations that might be useful.
Now it's time to take action, and implement the strategies we planned in the previous step!
Let's go on with the Data Cleaning checklist:
def drop_values(df, feature):
df[feature] = df[feature].dropna()
return df
train = drop_values(train, 'date_invoice')
def drop_erroneous(df, feature, threshold):
return df[df[feature]>threshold]
train = drop_erroneous(train, 'sales_net', 0)
def encoder_processor(df, feature):
df = pd.get_dummies(df, columns = feature)
return df
train = encoder_processor(train, ['order_channel'])
def features_keeper(df, features):
return df[features]
train = features_keeper(train, ['client_id','date_order','quantity','sales_net'])
def extract_dates(df):
df['date_order'] = pd.to_datetime(df['date_order'], format='%Y-%m-%d')
df['data_order_year']=df['date_order'].dt.year
df = encoder_processor(df, ['data_order_year'])
df['data_order_quarter']=df['date_order'].dt.quarter
df = encoder_processor(df, ['data_order_quarter'])
return df
train = extract_dates(train)
def type_casting(df):
df.quantity = df.quantity.astype(np.float32)
df.sales_net = df.sales_net.astype(np.float32)
df.sales_net=df.sales_net.astype(np.float32)
df.data_order_year_2017=df.data_order_year_2017.astype(np.uint8)
df.data_order_year_2018=df.data_order_year_2018.astype(np.uint8)
df.data_order_year_2019=df.data_order_year_2019.astype(np.uint8)
df.data_order_quarter_1=df.data_order_quarter_1.astype(np.uint8)
df.data_order_quarter_2=df.data_order_quarter_2.astype(np.uint8)
df.data_order_quarter_3=df.data_order_quarter_3.astype(np.uint8)
df.data_order_quarter_4=df.data_order_quarter_4.astype(np.uint8)
return df
train = type_casting(train)
numerical = ['quantity', 'sales_net']
def numerical_scaler(df, numerical):
'''
Scales the numerical features, with an StandardScaler()/MinMaxScaler()
'''
scaler = MinMaxScaler()
df[numerical] = scaler.fit_transform(df[numerical])
return df
train = numerical_scaler(train, numerical)
def trend_processor(df, periods=1):
df = df.sort_values(['client_id', 'date_order'])
df['prev_order'] = df['date_order'].shift(periods)
df['delay'] = df['date_order'] - df['prev_order']
df['avg_delay'] = df.groupby('client_id')['delay'].transform('mean')
df['trend_pred'] = df['delay'].gt(df['avg_delay']).astype(int)
df['trend_pred'] = df.groupby('client_id')['trend_pred'].transform(lambda x: 1 if (x == 1).any() else 0)
num_orders = df.groupby('client_id').size()
last_order_date = df.groupby('client_id')['date_order'].last()
if ((num_orders < periods) & last_order_date.dt.year.eq(2019) & last_order_date.dt.quarter.eq(4)).any():
df['trend_pred']=0
df = df.drop(['prev_order', 'delay', 'avg_delay'], axis = 1)
df['trend_pred'] = df['trend_pred'].astype(np.uint8)
return df
train = type_casting(train)
train = trend_processor(train, 1)
def rfm_processor(df):
df.quantity = df.quantity.astype(np.int64)
r = RFM(df, customer_id='client_id', transaction_date='date_order', amount='quantity')
r.rfm_table["client_id"] = r.rfm_table["client_id"].astype('int64')
df = df.merge(r.rfm_table, on='client_id', how='inner')
enc = OrdinalEncoder()
df[['segment']] = enc.fit_transform(df[['segment']])
df = df.drop(['r', 'f', 'm', 'rfm_score'], axis = 1)
df = numerical_scaler(df, ['recency', 'frequency', 'monetary_value'])
df['rfm_pred'] = df['segment'].apply(lambda x: 0 if x < 6 else 1)
df = df.drop('segment', axis = 1)
return df
train = type_casting(train)
train = rfm_processor(train)
def cluster_processor(df, k):
kmeans = KMeans(n_clusters = k, random_state = SEED, max_iter=1000).fit(df)
df['kmeans_pred'] = kmeans.predict(df)
return df
train['sales_net'] = pd.to_numeric(train['sales_net'], errors='coerce')
train = type_casting(train)
train = train.drop(['date_order'], axis = 1)
train = train.dropna()
train = cluster_processor(train, 2)
def anomaly_processor(df):
model = IsolationForest(random_state=SEED)
model.fit(df)
pred = model.predict(df)
df['anomaly_pred'] = pred
df['anomaly_pred'] = df['anomaly_pred'].replace(-1, 1).replace(1, 0)
return df
train = anomaly_processor(train)
def churn_imputer(df):
df['majority'] = df[['trend_pred', 'rfm_pred', 'kmeans_pred', 'anomaly_pred']].sum(axis=1)
df['Churn'] = df['majority'].apply(lambda x: 1 if x >= 2 else 0)
df = df.drop(['trend_pred', 'rfm_pred', 'kmeans_pred', 'anomaly_pred', 'majority'], axis = 1)
return df
train['Churn'] = 0
train = churn_imputer(train)
fig = px.scatter_3d(train, x='frequency', y='recency', z='monetary_value',
color='Churn')
fig.show()